NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Reducing Confusion in Active Learning for Part-Of-Speech Tagging

https://doi.org/10.1162/tacl_a_00350

Chaudhary, Aditi; Anastasopoulos, Antonios; Sheikh, Zaid; Neubig, Graham (February 2021, Transactions of the Association for Computational Linguistics)
null (Ed.)
Active learning (AL) uses a data selection algorithm to select useful training samples to minimize annotation cost. This is now an essential tool for building low-resource syntactic analyzers such as part-of-speech (POS) taggers. Existing AL heuristics are generally designed on the principle of selecting uncertain yet representative training instances, where annotating these instances may reduce a large number of errors. However, in an empirical study across six typologically diverse languages (German, Swedish, Galician, North Sami, Persian, and Ukrainian), we found the surprising result that even in an oracle scenario where we know the true uncertainty of predictions, these current heuristics are far from optimal. Based on this analysis, we pose the problem of AL as selecting instances that maximally reduce the confusion between particular pairs of output tags. Extensive experimentation on the aforementioned languages shows that our proposed AL strategy outperforms other AL strategies by a significant margin. We also present auxiliary results demonstrating the importance of proper calibration of models, which we ensure through cross-view training, and analysis demonstrating how our proposed strategy selects examples that more closely follow the oracle data distribution. The code is publicly released here. 1
more » « less
Full Text Available
When is Wall a Pared and when a Muro?: Extracting Rules Governing Lexical Selection

https://doi.org/10.18653/v1/2021.emnlp-main.553

Chaudhary, Aditi; Yin, Kayo; Anastasopoulos, Antonios; Neubig, Graham (January 2021, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing)

Full Text Available
Evaluating the Morphosyntactic Well-formedness of Generated Texts

https://doi.org/10.18653/v1/2021.emnlp-main.570

Pratapa, Adithya; Anastasopoulos, Antonios; Rijhwani, Shruti; Chaudhary, Aditi; Mortensen, David R.; Neubig, Graham; Tsvetkov, Yulia (January 2021, Evaluating the Morphosyntactic Well-formedness of Generated Texts)

Text generation systems are ubiquitous in natural language processing applications. However, evaluation of these systems remains a challenge, especially in multilingual settings. In this paper, we propose L’AMBRE – a metric to evaluate the morphosyntactic well-formedness of text using its dependency parse and morphosyntactic rules of the language. We present a way to automatically extract various rules governing morphosyntax directly from dependency treebanks. To tackle the noisy outputs from text generation systems, we propose a simple methodology to train robust parsers. We show the effectiveness of our metric on the task of machine translation through a diachronic study of systems translating into morphologically-rich languages.
more » « less
Full Text Available
Automatic Extraction of Rules Governing Morphological Agreement

https://doi.org/10.18653/v1/2020.emnlp-main.422

Chaudhary, Aditi; Anastasopoulos, Antonios; Pratapa, Adithya; Mortensen, David R.; Sheikh, Zaid; Tsvetkov, Yulia; Neubig, Graham (January 2020, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP))
null (Ed.)
Full Text Available
CMU-01 at the SIGMORPHON 2019 Shared Task on Crosslinguality and Context in Morphology

Chaudhary, Aditi; Salesky, Elizabeth; Bhat, Gayatri; Mortensen, David R.; Carbonell, Jaime G.; Tsvetkov, Yulia (August 2019, SIGMORPHON 2019: 16th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology)

This paper presents the submission by the CMU-01 team to the SIGMORPHON 2019 task 2 of Morphological Analysis and Lemmatization in Context. This task requires us to produce the lemma and morpho-syntactic description of each token in a sequence, for 107 treebanks. We approach this task with a hierarchical neural conditional random field (CRF) model which predicts each coarse-grained feature (eg. POS, Case, etc.) independently. However, most treebanks are under-resourced, thus making it challenging to train deep neural models for them. Hence, we propose a multi-lingual transfer training regime where we transfer from multiple related languages that share similar typology.
more » « less
Full Text Available

Search for: All records